# Multimodal input

Mistral Small 3.2 24B Instruct 2506 GGUF
Apache-2.0
Mistral Small 3.2 24B Instruct 2506 is a multilingual large language model that supports text and image input and text output, with a context length of 128k.
Image-to-Text Supports Multiple Languages
M
lmstudio-community
5,588
1
Gemma 3n E2B It
Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google, built on the same research and technology as the Gemini model. It supports text, audio, and visual inputs and is suitable for various tasks.
Image-to-Text Transformers
G
google
1,183
26
Qwen2.5 Omni 7B GGUF
Other
Qwen2.5-Omni-7B-GGUF is the GGUF format version of the Qwen2.5-Omni-7B model, supporting multimodal inputs including text, audio, and images.
Large Language Model English
Q
ggml-org
319
3
Qwen2.5 Omni 3B GGUF
Other
Qwen2.5-Omni-3B is a multimodal model that supports text, audio, and image input, but does not support video input or audio generation.
Large Language Model English
Q
ggml-org
126
1
DAM 3B Video
Other
DAM-3B-Video is a 3-billion-parameter vision-language model capable of generating fine-grained local descriptions for user-specified image/video regions.
Image-to-Text Safetensors English
D
nvidia
426
42
Stable Diffusion 3.5 Large Controlnet Canny
Other
Canny edge detection control network adapted for Stable Diffusion 3.5 large model, used for precise control of image generation process
Image Generation English
S
stabilityai
737
10
LTX Video
Other
The first DiT-based video generation model capable of real-time generation of high-quality videos, supporting two scenarios: text-to-video and image + text-to-video.
Text-to-Video English
L
Lightricks
165.42k
1,174
Diva Llama 3 V0 8b
DiVA Llama 3 is an end-to-end voice assistant model capable of processing both speech and text inputs, trained using distillation loss.
Text-to-Audio Transformers
D
WillHeld
2,596
34
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase